INTELLIGENT CONSTRUCTION OF APPLICATION — ORIENTED DATABASES
V.V.Kluev, S.Y.Garnaev
Saint Petersburg State University, St.Petersburg Institute of Fine Mechanics and Optics
Аннотация
— В работе рассмотрен способ автоматизированного формирования предметно – ориентированных баз данных в Internet на основе использования программы – робота. Представлены принципы функционирования программы – робота. Обсужден порядок ее настройки за предметную область, взаимодействие с человеком – экспертом по отбору релевантных документов. Проанализированы результаты экспериментов для ряда предметных областей.Nowadays there are a lot of specialized subject servers on the Internet. Tremendous financial, human and time resources were spent to build thematic collections for them. Traditionally the following schema is used for construction of specialized servers. An expert manually searches relevant information in the Internet. This process takes a lot of time. For example, the database for WWW-server ‘Physics in the Internet’, supported by Russian Basic Research Foundation (project 96-07 89149, http://www.physics.nw.ru) have been built for three years. A database contains 3000 documents. The traditional schema has to be altered if the results of the project ‘Open Architecture Server for Information Search and Delivery’ — OASIS are used. The expert who is responsible for some part of database forms a collection from documents existed in the database. This collection is a core of the most relevant documents from the given topic. All documents are in the same language. For a new topic he/she has to find core of documents in the Internet by traditional manner. Task for the Crawler is to find similar documents in the Internet. The documents retrieved by the Crawler as relevant are moved to the expert so that he/she could make the final decision whether they are worth being included in the database.
The Crawler is a program that recursively retrieves documents from the Internet and checks the found documents for relevance to the database topic. For each document a score is calculated according to the Crawler’s filter. The filter describes the database topic. The filter contains terms with associated term’s weights and threshold. The threshold is a positive number which is used to decide whether a document is relevant enough to the database. The relevance score is calculated using a document-term frequency profile. A document is relevant to database if its score is greater than the threshold. From each retrieved document the Crawler extracts URLs (Uniform Resource Locators) and forms a queue of found URLs. The queue is sorted according to document’s score. The Crawler takes URLs from this queue to retrieve the documents.
OASIS Crawler has been implemented as a part of the OASIS project [1,2] and is currently in the alpha stage.
Bellow we describe the technology used in the experiments.
To test mentioned above technology, it was necessary to construct test databases consisting of real Web documents with human evaluations of the pages’ topics. Since standard test collection, such as TREC collections, does not exist for real Internet data, some categories in the Yahoo! directory tree were selected and the pages listed in those categories were used to form test databases. The next categories were selected: Benchmarks, Museums, Research Groups, Travelogues, Information Retrieval, Card Games, Programming Languages, Unix-Linux. Documents listed in any given Yahoo! category are assumed to be relevant to the only category’s topic.
Table 1 shows selected values of parameters mentioned above. The database core contains 100 – 150 documents, the filter is formed from 1000 terms, the threshold provides recall, which equal 50%. The Crawler has run 48 - 385 hours.
Parameter values selection
Table 1.
Parameter |
Value |
M |
100 - 150 |
N |
1000 |
K |
50% |
H |
48 - 385 |
Z |
100000 |
Table 2 shows results of experiments for construction of collections. The Crawler had run on the server of Dublin University (hardware-Pentium II, 330MHZ, software-Linux). The process had worked as background process. The speed of collections gathering was very different. In average 4 documents were recommended for each 100 evaluated documents. An amount of duplicate – documents (i.e. documents gathered from “mirrors”) was different.
Documents recommended for collections Information Retrieval and Physics were totally manually analyzed. For another collections some documents were analyzed from each directory and also all the documents, which did not belong to any directory.
Manual expert analysis showed that the average percent of recommended documents that were actually junk (i.e. they did not belong to the given topic) was approximately 61%. So, in average four from each ten recommended by the Crawler documents were correct. In other words precision was about 39% (precision is the fraction of the recommended documents that are actually relevant). The best result was showed by the physics collection: “garbage” is about 20%.
Table 3 shows some interesting results. There were two starts of the Crawler for two collections: Information Retrieval and Card Games. The filter’s length for one of them was 1000 terms, for another was 100. The short filter was constructed manually by expert. To construct topic filter, 100 significant terms were selected from dictionary of the collection. Precision increased for both collections.
Construction of Collection
Table 2.
Collection |
Core |
Time: Hours |
Evaluat. Docs |
Recom. Docs |
Dup. Docs |
Manually Analysis |
Relev. Docs |
Precision |
Travelogues |
149 |
82 |
~8000 |
2883 |
62 |
Dir. + file. 553 + 338 |
226 (48 + 137) |
0.08 |
Research Groups |
100 |
88 |
~22500 |
3091 |
60 |
777 + 134 |
811 (372 + 45) |
0.29 |
Museums |
121 |
150 |
~75000 |
3631 |
444 |
1780 + 392 |
444 (227 + 39) |
0.14 |
Benchmarks |
110 |
44 |
~16000 |
1233 |
350 |
136 +33 |
649 (85 + 24) |
0.73 |
Programming Languages |
121 |
40 |
12500 |
1004 |
22 |
331 + 88 |
445 (202 + 18) |
0.45 |
Information Retrieval |
110 |
385 |
~91000 |
594 |
14 |
580 |
202 |
0.34 |
Physics |
50 |
240 |
100000 |
607 |
33 |
574 |
467 |
0.81 |
Card Games |
110 |
10 |
10000 |
4170 |
114 |
134 + 129 |
798 (550+78) |
0.2 |
The effect of reduce of filter’s length
Table 3
.Collection |
Filter’s length |
Time: Hours |
Evaluat. Docs |
Recom. Docs |
Relev. Docs |
Precision |
Information Retrieval |
1000 |
385 |
91000 |
594 |
202 |
0.34 |
Information Retrieval |
100 |
47 |
< 30000 |
479 |
229 |
0.49 |
Card Games |
1000 |
10 |
10000 |
4150 |
798 |
0.2 |
Card Games |
100 |
10 |
10000 |
675 |
219 |
0.32 |
The OASIS Crawler can independently construct an application – oriented databases of Internet documents on the base of an information filtering technology and heuristics. Preliminary results have shown that this technology has promising performance characteristics. The hypothesis about reducing a length of filter has to be confirmed in the further experimental tests.
Site of Information
Technologies Designed by inftech@webservis.ru. |
|